Question 1

Part A

data <- read.csv('car93.csv')

pca_cars <- prcomp(data[,-1:-3], scale.=TRUE)
print(summary(pca_cars))
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5    PC6     PC7
## Standard deviation     3.1704 1.2840 0.97225 0.76321 0.6504 0.5732 0.53562
## Proportion of Variance 0.6701 0.1099 0.06302 0.03883 0.0282 0.0219 0.01913
## Cumulative Proportion  0.6701 0.7800 0.84304 0.88187 0.9101 0.9320 0.95110
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.44946 0.41551 0.34330 0.26369 0.26018 0.20809 0.19842
## Proportion of Variance 0.01347 0.01151 0.00786 0.00464 0.00451 0.00289 0.00262
## Cumulative Proportion  0.96457 0.97608 0.98393 0.98857 0.99308 0.99597 0.99859
##                           PC15
## Standard deviation     0.14522
## Proportion of Variance 0.00141
## Cumulative Proportion  1.00000
biplot(pca_cars)

Part B and C

print(pca_cars$rotation[,1:2])
##                           PC1         PC2
## Price               0.2204708  0.37225168
## MPG.city           -0.2686351 -0.19307271
## MPG.highway        -0.2428963 -0.28759336
## EngineSize          0.2993827 -0.05044005
## Horsepower          0.2504771  0.39222489
## RPM                -0.1427112  0.52833268
## Rev.per.mile       -0.2626478  0.12094301
## Fuel.tank.capacity  0.2819184  0.16643949
## Length              0.2911181 -0.11364032
## Wheelbase           0.2890445 -0.10287701
## Width               0.2861280 -0.10943488
## Turn.circle         0.2633530 -0.13849282
## Rear.seat.room      0.1806914 -0.27829245
## Luggage.room        0.2299132 -0.34560414
## Weight              0.3065978  0.10976982

PC1
From the first principal component, I can see majority of the coefficients are positive except for 4 which are negative (MPG.city, MPG.highway, RPM, Rev.per.mile). All 4 of the negative coefficients are some sort of rate, like the miles per gallon in the city of highway, and revolutions per minute or per mile. The largest coefficient in magnitude is weight while the smallest is RPM.

PC2 In the second principal component, there is a larger variance in the coefficients than the first principal component with the largest coefficient in magnitude being RPM and the smallest being EngineSize.There are more coefficients closer to 0, indicating a lot of the variation in those predictors was captured in the first component. In addition, the negative coefficients in the 2nd component have something to do with distance or size, but the grouping is not as clear as in the first component, also indicating the variation being captured is lower than in the first component.

Part D

print(summary(pca_cars))
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5    PC6     PC7
## Standard deviation     3.1704 1.2840 0.97225 0.76321 0.6504 0.5732 0.53562
## Proportion of Variance 0.6701 0.1099 0.06302 0.03883 0.0282 0.0219 0.01913
## Cumulative Proportion  0.6701 0.7800 0.84304 0.88187 0.9101 0.9320 0.95110
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.44946 0.41551 0.34330 0.26369 0.26018 0.20809 0.19842
## Proportion of Variance 0.01347 0.01151 0.00786 0.00464 0.00451 0.00289 0.00262
## Cumulative Proportion  0.96457 0.97608 0.98393 0.98857 0.99308 0.99597 0.99859
##                           PC15
## Standard deviation     0.14522
## Proportion of Variance 0.00141
## Cumulative Proportion  1.00000
plot(pca_cars, type="lines")

I. Kaiser Criterion: For the Kaiser criterion, for scaled PCA, keep all principal components where the variance/standard deviation is greater than or equal to 1, and so the first 2 principal components should be kept.
II. Cumulative Proportion: To keep 90% of the variation in the data, the first 4 principal components capture about 88% of the variation, so in order to capture 90%, the first 5 principal components must be kept.

III. Scree Plot: Based on the scree plot, the graph seems to plateau around the 3rd component where 3 is not much different than 4 and 4 is not much different than 5. Thus, based on the scree plot, the first 2 principal components should be kept.

Part E

I.

predictors_pca <- as.data.frame(pca_cars$x[, 1:2])
response <- ifelse(data$Type == "Small", "Small", "Not Small")

lda_cars <- lda(response ~ ., data = data.frame(response = response, predictors_pca), CV = TRUE)

y_1 <- ifelse(response == 'Not Small', 1, 0)
y_2 <- ifelse(response == 'Small', 1, 0)

y <- cbind(y_1, y_2)

log_product <- y * log(lda_cars$posterior)

log_loss <- -mean(log_product[log_product != 0])
print(log_loss)
## [1] 0.2216978

II.

predictors_pca <- as.data.frame(pca_cars$x[, 1:2])
response <- data$Type

lda_cars <- lda(response ~ ., data = data.frame(response = response, predictors_pca), CV = TRUE)

y_1 <- ifelse(response == 'Compact', 1, 0)
y_2 <- ifelse(response == 'Large', 1, 0)
y_3 <- ifelse(response == 'Midsize', 1, 0)
y_4 <- ifelse(response == 'Small', 1, 0)
y_5 <- ifelse(response == 'Sporty', 1, 0)

y <- cbind(y_1, y_2, y_3, y_4, y_5)

log_product <- y * log(lda_cars$posterior)

log_loss <- -mean(log_product[log_product != 0])

print(log_loss)
## [1] 0.8125036

Part F

Based on the above, we see the log-loss where we have 2 categories of vehicle type (“Small” and “Not Small”) has a smaller log-loss than when we included all vehicle types. It is reasonable to assume that in order to classify a prediction from more categories, a better model or more understanding of the groups/data would be needed. However, we used the same model and same number of principal components (ie. the amount of variation being used is the same) and the log-loss in the 2nd iteration performed worse because the log-loss metric heavily penalizes any high-confident misclassifications.

This matches our discussion in which it was concluded that there is no guarantee that higher variance components will lead to more predictive power towards y. We also don’t know which principal components will explain the variation in y and so even though we only used the first 2 principal components, we may be able to get a lower log-loss had we used other principal components with lower variance being captured.

Question 2

Part A

data <- banknote
print(summary(data[,-1:-2]))
##       Left           Right           Bottom            Top       
##  Min.   :129.0   Min.   :129.0   Min.   : 7.200   Min.   : 7.70  
##  1st Qu.:129.9   1st Qu.:129.7   1st Qu.: 8.200   1st Qu.:10.10  
##  Median :130.2   Median :130.0   Median : 9.100   Median :10.60  
##  Mean   :130.1   Mean   :130.0   Mean   : 9.418   Mean   :10.65  
##  3rd Qu.:130.4   3rd Qu.:130.2   3rd Qu.:10.600   3rd Qu.:11.20  
##  Max.   :131.0   Max.   :131.1   Max.   :12.700   Max.   :12.30  
##     Diagonal    
##  Min.   :137.8  
##  1st Qu.:139.5  
##  Median :140.4  
##  Mean   :140.5  
##  3rd Qu.:141.5  
##  Max.   :142.4
print(cor(data[,-1:-2]))
##                Left      Right     Bottom        Top   Diagonal
## Left      1.0000000  0.7432628  0.4137810  0.3623496 -0.5032290
## Right     0.7432628  1.0000000  0.4867577  0.4006702 -0.5164755
## Bottom    0.4137810  0.4867577  1.0000000  0.1418513 -0.6229827
## Top       0.3623496  0.4006702  0.1418513  1.0000000 -0.5940446
## Diagonal -0.5032290 -0.5164755 -0.6229827 -0.5940446  1.0000000

Based on the summary of the data, it must be scaled because the magnitude of Bottom and Top is much lower than the other predictors.

Scaled euclidean distance seems appropriate due to its simplicity. Scaled Manhattan distance seemed appealing due to the values being the perimeters of the banknote, ie left, right, top, and bottom edges, however, I wasn’t sure if travelling along the axes requirement applied to the bank note data. Mahalanobis was considered due to the correlation between the left and right predictors, but was dropped because the correlation amongst the other predictors was not as high, particularly bottom and top.

Part B

scaled_data <- scale(data[,-1:-2])
dist_matrix <- dist(scaled_data, method = "euclidean")
hc_single <- hclust(dist_matrix, method = "single")
hc_complete <- hclust(dist_matrix, method = "complete")
hc_avg <- hclust(dist_matrix, method = "average")

plot(hc_single, hang = -1, labels = rownames(data))

plot(hc_complete, hang = -1, labels = rownames(data))

plot(hc_avg, hang = -1, labels = rownames(data))

Part C

The dendrogram from the “complete” linkage method looks the best and the dendrogram from the “average” linkage method looks somewhat acceptable. The “single” linkage method looks strange, seems like there is a chaining pattern occurring.

The “complete” dendrogram indicates there are 2 groups and the “average” dendrograms indicates there are likely 3 groups, but I would ultimately go with the “complete” linkage method because the dendrogram looked the best and conclude there are likely 2 groups in the data.

Part D

clusters <- cutree(hc_complete, k=2)
# data$cluster <- as.factor(clusters)

conf_matrix <- table(data$Status, clusters)
print(conf_matrix)
##              clusters
##                 1   2
##   counterfeit 100   0
##   genuine      30  70
misclassification_rate <- 1 - sum(diag(conf_matrix)) / sum(conf_matrix)
print(misclassification_rate)
## [1] 0.15

Part E

set.seed(632)

k_means <- kmeans(scaled_data, 2, nstart=25)
conf_matrix <- table(data$Status, k_means$cluster)
print(conf_matrix)
##              
##                 1   2
##   counterfeit 100   0
##   genuine       8  92
misclassification_rate <- 1 - sum(diag(conf_matrix)) / sum(conf_matrix)
print(misclassification_rate)
## [1] 0.04

Part F

set.seed(632)

k_means <- kmeans(data[,-1:-2], 2, nstart=25)
conf_matrix <- table(data$Status, k_means$cluster)
print(conf_matrix)
##              
##                 1   2
##   counterfeit   0 100
##   genuine     100   0
n <- nrow(conf_matrix)
opposite_diag_values <- conf_matrix[cbind(n:1, 1:n)]

misclassification_rate <- 1 - sum(opposite_diag_values) / sum(conf_matrix)
print(misclassification_rate)
## [1] 0

A potential reason as to why the unscaled data performs better than the scaled version is because distance between data points for particular predictors may be important for clustering. As an example, the difference between the left/right sides of the banknote may be more important than the difference between top/bottom which are originally very far in terms of scale, but become standardized during the scaling process and that difference becomes less prominent.

Part G

The strong performance of unsupervised techniques on the banknote data set indicates there is an underlying structure/pattern within the data which allowed the unsupervised technique to be so successful. The strong performance also indicates there are important features which can be used for anomaly detection which can be used to determine if certain banknotes deviate from the norm. In general, the strong performance signifies some relationship which can be further explored/analyzed.

Question 3

Part A

load("lots.Rdata")

data <- data.frame(datmat)
data$clusts <- as.factor(clusts)

plot <- ggplot(data, aes(x=X1, y=X2, color=clusts)) +
  geom_point()

print(plot)

Part B

set.seed(1026)

scaled_data <- scale(data[,-3])

k_means <- kmeans(scaled_data, 20)
adj_rand_index <- adjustedRandIndex(as.numeric(data$clusts), as.numeric(k_means$cluster))
print(adj_rand_index)
## [1] 0.7748961

Part C

set.seed(6201)

scaled_data <- scale(data[,-3])

k_means <- kmeans(scaled_data, 20)
adj_rand_index <- adjustedRandIndex(as.numeric(data$clusts), as.numeric(k_means$cluster))
print(adj_rand_index)
## [1] 0.8971005

Part D

set.seed(1026)

scaled_data <- scale(data[,-3])

k_means <- kmeans(scaled_data, 20, nstart=1000)
adj_rand_index <- adjustedRandIndex(as.numeric(data$clusts), as.numeric(k_means$cluster))
print(adj_rand_index)
## [1] 0.9567124

Part E

set.seed(6201)

scaled_data <- scale(data[,-3])

k_means <- kmeans(scaled_data, 20, nstart=1000)
adj_rand_index <- adjustedRandIndex(as.numeric(data$clusts), as.numeric(k_means$cluster))
print(adj_rand_index)
## [1] 0.9641084

Part F

Based on the results from B and C, we get adjusted Rand Index’s that are not very close which indicates the random start is playing an important role in the clusters being formed by the k-means algorithm - more importantly it also raises the question if the k we selected of 20 is actually correct because the solutions seem to differ from the original values.

Moving to the results from part D and E, we see the adjusted Rand Index for both are very close, indicating the initial start is not as important and that we eventually do reach similar groups such that the adjusted Rand Index is almost the same. This also reinforces our selection of 20 groups being valid and concludes the clusters formed via k-means are very similar to the original groups.

Question 4

a <- load('bsim.Rdata')

data <- data.frame(asim)

# initial data exploration
plot(data)

print(cor(data[,-1]))
##              V2         V3          V4         V5          V6         V7
## V2   1.00000000 -0.7176512 -0.02516521  0.7073009  0.22288149  0.8288672
## V3  -0.71765119  1.0000000 -0.15587899 -0.6831011 -0.22714898 -0.7976255
## V4  -0.02516521 -0.1558790  1.00000000  0.1312994 -0.07314724  0.1521966
## V5   0.70730092 -0.6831011  0.13129941  1.0000000  0.24767181  0.8086329
## V6   0.22288149 -0.2271490 -0.07314724  0.2476718  1.00000000  0.2618375
## V7   0.82886719 -0.7976255  0.15219662  0.8086329  0.26183747  1.0000000
## V8   0.76762981 -0.6740252  0.04290202  0.6894926  0.20735075  0.7853305
## V9  -0.69315038  0.7150549 -0.17721595 -0.7314435 -0.28115894 -0.8374997
## V10 -0.47654425  0.4520752 -0.21329907 -0.4296194 -0.19552181 -0.5146011
##              V8         V9        V10
## V2   0.76762981 -0.6931504 -0.4765443
## V3  -0.67402515  0.7150549  0.4520752
## V4   0.04290202 -0.1772159 -0.2132991
## V5   0.68949263 -0.7314435 -0.4296194
## V6   0.20735075 -0.2811589 -0.1955218
## V7   0.78533051 -0.8374997 -0.5146011
## V8   1.00000000 -0.6662969 -0.4336191
## V9  -0.66629686  1.0000000  0.4767139
## V10 -0.43361914  0.4767139  1.0000000
print(cov(data[,-1]))
##              V2         V3         V4         V5         V6         V7
## V2   15.5514796 -11.871511 -0.2585403  15.149309  1.9420930  22.744933
## V3  -11.8715114  17.595982 -1.7034767 -15.563041 -2.1053666 -23.281964
## V4   -0.2585403  -1.703477  6.7870924   1.857836 -0.4210659   2.759058
## V5   15.1493088 -15.563041  1.8578361  29.498877  2.9722784  30.561033
## V6    1.9420930  -2.105367 -0.4210659   2.972278  4.8822539   4.025834
## V7   22.7449333 -23.281964  2.7590575  30.561033  4.0258338  48.420327
## V8    7.0193204  -6.556019  0.2591654   8.683399  1.0623636  12.671374
## V9  -11.3442795  12.448291 -1.9160564 -16.487208 -2.5782539 -24.185891
## V10  -4.8186775   4.862461 -1.4248515  -5.983088 -1.1077561  -9.181704
##             V8         V9       V10
## V2   7.0193204 -11.344280 -4.818677
## V3  -6.5560195  12.448291  4.862461
## V4   0.2591654  -1.916056 -1.424852
## V5   8.6833988 -16.487208 -5.983088
## V6   1.0623636  -2.578254 -1.107756
## V7  12.6713740 -24.185891 -9.181704
## V8   5.3766870  -6.411930 -2.578129
## V9  -6.4119296  17.223729  5.072946
## V10 -2.5781287   5.072946  6.574724
# initial linear model
initial_linear <- lm(y ~. , data=data)
print(summary(initial_linear))
## 
## Call:
## lm(formula = y ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.8122  -4.4968   0.1784   4.4572  27.3810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -37.2355     5.2878  -7.042 6.45e-12 ***
## V2            2.3483     0.1704  13.785  < 2e-16 ***
## V3            1.5074     0.1369  11.009  < 2e-16 ***
## V4           -0.5731     0.1406  -4.076 5.34e-05 ***
## V5           -2.2136     0.1080 -20.491  < 2e-16 ***
## V6            1.8948     0.1603  11.822  < 2e-16 ***
## V7            1.6712     0.1334  12.526  < 2e-16 ***
## V8           -1.7447     0.2503  -6.971 1.02e-11 ***
## V9           -0.3703     0.1528  -2.423   0.0158 *  
## V10           3.2300     0.1573  20.530  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.473 on 490 degrees of freedom
## Multiple R-squared:   0.74,  Adjusted R-squared:  0.7352 
## F-statistic: 154.9 on 9 and 490 DF,  p-value: < 2.2e-16
# PCA
pca_data <- prcomp(data[,-1], scale.=TRUE)
print(biplot(pca_data))

## NULL
print(summary(pca_data))
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.2679 1.0543 0.94114 0.79362 0.57914 0.56492 0.50627
## Proportion of Variance 0.5715 0.1235 0.09842 0.06998 0.03727 0.03546 0.02848
## Cumulative Proportion  0.5715 0.6950 0.79340 0.86338 0.90065 0.93611 0.96458
##                            PC8     PC9
## Standard deviation     0.46400 0.32163
## Proportion of Variance 0.02392 0.01149
## Cumulative Proportion  0.98851 1.00000
pca_predictors <- as.data.frame(pca_data$x[, 1:2])

pca_predictors <- data.frame(data[,1], pca_predictors)
colnames(pca_predictors)[colnames(pca_predictors) == 'data...1.'] <- 'y'

linear <- lm(y ~., data=pca_predictors)
summary(linear)
## 
## Call:
## lm(formula = y ~ ., data = pca_predictors)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.466  -8.128   0.716   7.981  45.308 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.7747     0.5733  78.103  < 2e-16 ***
## PC1           1.1609     0.2530   4.588 5.68e-06 ***
## PC2          -6.0199     0.5443 -11.060  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.82 on 497 degrees of freedom
## Multiple R-squared:  0.2239, Adjusted R-squared:  0.2208 
## F-statistic: 71.68 on 2 and 497 DF,  p-value: < 2.2e-16
# Clustering
scaled_data <- data.frame(scale(data[,-1], scale=TRUE))

# Decide to go with 2 groups based on biplot from PCA
k_means <- kmeans(scaled_data, 2, nstart=1000)

temp_data <- data
temp_data$cluster <- k_means$cluster
linear_clusters <- lm(y ~ cluster*(V2+V3+V4+V5+V6+V7+V8+V9+V10) , data=temp_data)
print(summary(linear_clusters))
## 
## Call:
## lm(formula = y ~ cluster * (V2 + V3 + V4 + V5 + V6 + V7 + V8 + 
##     V9 + V10), data = temp_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.99101 -0.59457  0.03553  0.61065  2.98516 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -5.31013    2.70073   -1.966   0.0499 *  
## cluster      7.56585    1.76602    4.284 2.22e-05 ***
## V2           7.62656    0.08841   86.261  < 2e-16 ***
## V3          -1.72147    0.05798  -29.689  < 2e-16 ***
## V4          -6.59587    0.06230 -105.868  < 2e-16 ***
## V5          -2.07703    0.04572  -45.433  < 2e-16 ***
## V6           2.08090    0.06557   31.733  < 2e-16 ***
## V7           0.11227    0.07939    1.414   0.1580    
## V8          -1.23255    0.10343  -11.917  < 2e-16 ***
## V9          -7.15220    0.08045  -88.907  < 2e-16 ***
## V10          2.59536    0.06502   39.914  < 2e-16 ***
## cluster:V2  -5.11942    0.07260  -70.516  < 2e-16 ***
## cluster:V3   1.90348    0.04313   44.138  < 2e-16 ***
## cluster:V4   4.27634    0.04067  105.159  < 2e-16 ***
## cluster:V5  -0.02066    0.03426   -0.603   0.5468    
## cluster:V6  -0.43149    0.04341   -9.940  < 2e-16 ***
## cluster:V7   0.89608    0.05052   17.737  < 2e-16 ***
## cluster:V8  -0.73147    0.07931   -9.223  < 2e-16 ***
## cluster:V9   4.95254    0.05917   83.704  < 2e-16 ***
## cluster:V10 -0.23190    0.04657   -4.979 8.92e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9774 on 480 degrees of freedom
## Multiple R-squared:  0.9956, Adjusted R-squared:  0.9955 
## F-statistic:  5773 on 19 and 480 DF,  p-value: < 2.2e-16

Based on the initial linear model, it has an R-squared value of about 0.74 which is well below 0.99 target. Performaning PCA and PCReg, we see the PCReg performs much worse at an R-squared value of 0.22, however, from the biplot, we can clearly see 2 groups.

After performing a K-means clustering with 2 groups, I re-try the linear model, but this time, I multiply the predictors by the cluster value resulting in a much better linear fit and achieveing the 0.99 R-squared target.

Question 5

Question 5 solution
Question 5 solution